setwd('~/')
setwd('~/Documents/UdacityDAND/EDAFinalProject')
ww <- read.csv('wineQualityWhites.csv')
# install.packages('Psych')
library(GGally)
library(psych)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(gridExtra)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'memisc'
## The following objects are masked from 'package:stats':
##
## contrasts, contr.sum, contr.treatment
## The following object is masked from 'package:base':
##
## as.array
head(ww)
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## 4 4 7.2 0.23 0.32 8.5 0.058
## 5 5 7.2 0.23 0.32 8.5 0.058
## 6 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
I am to examine the data and what all variables and attributes it contains.
This report explores a dataset containing attributes for 4,898 white wines with 13 which includes 11 variables on quantifying the chemical properties of each wine.
nrow(ww)
## [1] 4898
ncol(ww)
## [1] 13
str(ww)
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
summary(ww)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
qplot(x = fixed.acidity, data = ww, binwidth = .1) +
scale_x_continuous(limits = c(4, 10), breaks = seq(4, 10, .5))
## Warning: Removed 9 rows containing non-finite values (stat_bin).
This is a noraml curve and gives a fair understanding of the distribution. This distribution is unimodal with the fixed acidity peaking around 6.8. There were some outliers before fixed acidity value of 4 and beyond 10 which has been removed. According to waterhouse most wines have tartaric acid value between 1 g/dm^3 and 4 g/dm^3. Is there a strong correlation between fixed acidity and pH value? Now let’s explore what the plots look like for other variables.
qplot(x = volatile.acidity, data = ww, binwidth = .01) +
scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, .1))
## Warning: Removed 2 rows containing non-finite values (stat_bin).
This is also a unimodal, peaking around volatile acidity value of 0.28. Waterhouse claims that average acetic acid value is less than 400 mg/L. This is in sync with our dataset. The legal limit of acetic acid in US for white wine is 1.1 g/dm^3. Too much acetic acid can result in unpleasant aromas. In addition to undesirable aromas, both acetic acid and acetaldehyde are toxic to Saccharomyces cerevisiae and may lead to stuck fermentations.
qplot(x = citric.acid, data = ww, binwidth = .01) +
scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, .1))
## Warning: Removed 2 rows containing non-finite values (stat_bin).
This distribution is also normal with citric acid value peaking around 0.3. Why is there a sudden peak at arounf 0.49?
According to waterhouse one would expect to see 0 to 500mg/L citric acid. This might be why the value peaks at around 0.49-0.5.
qplot(x = residual.sugar, data = ww, binwidth = .1) +
scale_x_continuous(limits= c(0, 25), breaks = seq(0, 25, 2))
## Warning: Removed 5 rows containing non-finite values (stat_bin).
I observe a long tail distribution there are some extreme outliers around 30s and 70s which has been removed in the graph. According to winefolly.com: < 1 g/L(d/dm^3) - Bone Dry 1 to 10 g/L - Dry 10 to 35 g/L - Off-Dry 35 to 120 g/L - Sweet Wine 120 to 220 g/L - Very Sweet Wine
We can conclude that most of the wines in the data set are Dry wines.
A dry wine is when the yeast eats up all the sugar that is available and makes ethanol as a by product. This is why some sweet wines have less alcohol than its dry counterpart. We can look at the correlation between residual sugar content and alcohol. Is this an inverse relationship?
qplot(x = residual.sugar, data = ww) +
scale_x_continuous(limits= c(0, 25), breaks = seq(0, 25, 2)) +
scale_x_log10()
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The transformed distribution is bimodal and peaks at two places. First around 4 and then around 9. What do these peaks represent?
qplot(x = chlorides, data = ww, bin = .01) +
scale_x_continuous(limits = c(0, 0.1), breaks = seq(0, .1, .01))
## Warning: Ignoring unknown parameters: bin
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 110 rows containing non-finite values (stat_bin).
Majority of the values lies between 0 and 1. This is also a normal distribution with peak at around 0.4. Most wines have a salt content of less than 0.1.
qplot(x = free.sulfur.dioxide, data = ww, bin = 10) +
scale_x_continuous(limits = c(0, 150), breaks = seq(0, 150, 10))
## Warning: Ignoring unknown parameters: bin
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
Free Sulfur Dioxide seems like a normal distribution with its peak at approximately 30. Most wines have a Sulphur Dioxide content of less than 100.
qplot(x = total.sulfur.dioxide, data = ww, bin = 30) +
scale_x_continuous(limits = c(0, 320), breaks = seq(0, 320, 20))
## Warning: Ignoring unknown parameters: bin
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3 rows containing non-finite values (stat_bin).
Total Sulfur Dioxide Value is a normal distribution with a peak around 120s. Sulfites is used to preserve wines. Most people can easily digest sulfites but some people have extremem allergic reactions to sulfites. According to waterhouse the average sulfite content in wine is around 80 mg/L which is almost in sync with the dataset. S02 content above 50 is detectable in the nose and taste of wine. Given this, there are lots of wine in the dataset where SO2 content might become evident in the nose and taste of wine.
qplot(x = density, data = ww, binwidth = .001) +
scale_x_continuous(limits = c(.985, 1.015), breaks = seq(.985, 1.015, .005))
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
Density seems to follow a normal distribution with peak at nearly 0.992. There are a few outliers as well.
qplot(x = pH, data = ww, binwidth = .05) +
scale_x_continuous(limits = c(2.7, 4), breaks = seq(2.7, 4, .1))
pH seems to follow a normal distribution with peak at nearly 3.15. According to Dr.Vinny’s post in winespectartor.com, the ideal pH value for white wines is around 3.0-3.4.
qplot(x = sulphates, data = ww, binwidth = .01) +
scale_x_continuous(limits = c(0.2, 1.1), breaks = seq(0.2, 1.1, .05))
Normal distribution with a peak at .5. Potassium sulphate is the additive which will contribute to sulfur dioxide gas, which acts as an antimicrobial and antioxident.
qplot(x = alcohol, data = ww, binwidth = .1) +
scale_x_continuous(limits = c(8, 14.5), breaks = seq(8, 14.5, .5))
White wines have a distribution between 8.5% and 14%, with concentration between 9% and 10.5%.
qplot(x = quality, data = ww) +
geom_bar()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Most of the wines are given a quality score of 6. These values might be biased in many ways as it is a sensory data and completely subjective. The data might vary if a different set of experts is used for this.
Let’s look at all variable valus by quality:
qplot(x = fixed.acidity, data = ww, binwidth = .1) +
scale_x_continuous(limits = c(4, 10), breaks = seq(4, 10, .5)) +
facet_wrap(~ quality, nrow =5)
## Warning: Removed 9 rows containing non-finite values (stat_bin).
The fixed acidity (tartarc acid) for wines of different quality peaks between 6 and 8 g/L
qplot(x = volatile.acidity, data = ww, binwidth = .01) +
scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, .1)) +
facet_wrap(~ quality)
## Warning: Removed 2 rows containing non-finite values (stat_bin).
qplot(x = citric.acid, data = ww, binwidth = .01) +
scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, .1)) +
facet_wrap(~ quality)
## Warning: Removed 2 rows containing non-finite values (stat_bin).
qplot(x = residual.sugar, data = ww, binwidth = .1) +
scale_x_continuous(limits= c(0, 25), breaks = seq(0, 25, 2)) +
facet_wrap(~ quality)
## Warning: Removed 5 rows containing non-finite values (stat_bin).
qplot(x = residual.sugar, data = ww) +
scale_x_continuous(limits= c(0, 25), breaks = seq(0, 25, 2)) +
scale_x_log10() +
facet_wrap(~quality)
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(x = chlorides, data = ww, bin = .01) +
scale_x_continuous(limits = c(0, 0.1), breaks = seq(0, .1, .01)) +
facet_wrap(~ quality)
## Warning: Ignoring unknown parameters: bin
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 110 rows containing non-finite values (stat_bin).
qplot(x = free.sulfur.dioxide, data = ww, bin = 10) +
scale_x_continuous(limits = c(0, 150), breaks = seq(0, 150, 10)) +
facet_wrap(~ quality)
## Warning: Ignoring unknown parameters: bin
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
qplot(x = total.sulfur.dioxide, data = ww, bin = 30) +
scale_x_continuous(limits = c(0, 320), breaks = seq(0, 320, 20)) +
facet_wrap(~ quality)
## Warning: Ignoring unknown parameters: bin
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3 rows containing non-finite values (stat_bin).
qplot(x = density, data = ww, binwidth = .001) +
scale_x_continuous(limits = c(.985, 1.015), breaks = seq(.985, 1.015, .005)) +
facet_wrap(~ quality)
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 7 rows containing missing values (geom_bar).
qplot(x = pH, data = ww, binwidth = .05) +
scale_x_continuous(limits = c(2.7, 4), breaks = seq(2.7, 4, .1)) +
facet_wrap(~ quality)
qplot(x = sulphates, data = ww, binwidth = .01) +
scale_x_continuous(limits = c(0.2, 1.1), breaks = seq(0.2, 1.1, .05)) +
facet_wrap(~ quality)
qplot(x = alcohol, data = ww, binwidth = .1) +
scale_x_continuous(limits = c(8, 14.5), breaks = seq(8, 14.5, .5)) +
facet_wrap(~ quality)
ww$dryness <- ifelse(ww$residual.sugar < 1, "Bone Dry", ifelse((ww$residual.sugar>=1) & (ww$residual.sugar < 10), "Dry", ifelse((ww$residual.sugar >= 10) & (ww$residual.sugar < 35), "Off Dry", ifelse((ww$residual.sugar >=35) & (ww$residual.sugar<120), "Sweet", "Very Sweet"))))
qplot(x = dryness, data = ww) +
geom_bar()
# Univariate Analysis
The data set consists of 4,898 variants of the Portuguese White Wine “Vinho Verde”, with measurements of eleven chemical properties:
Fixed Acidity: acid that contributes to the conservation of wine. Volatile Acidity: Amount of acetic acid in wine at high levels can lead to an unpleasant taste of vinegar. Citric Acid: found in small amounts, can add “freshness” and flavor to wines. Residual sugar: amount of sugar remaining after the end of the fermentation. Chlorides: amount of salt in wine. Free Sulfur Dioxide: it prevents the increase of microbes and the oxidation of the wine. Total Sulfur Dioxide: it shows the aroma and taste of the wine. Density: density of water, depends on the percentage of alcohol and amount of sugar. pH: describes how acid or basic a wine is on a scale of 0 to 14. Sulfates: additive that acts as antimocrobian and antioxidant. Alcohol: percentage of alcohol present in the wine.
And a sensorial property: - Quality: grade between 0 and 10 given by specialists.
Observations: - Most wines have medium quality (5 and 6) - There’s no evident predictor of quality from the univariate analysis
The main features in the data set is quality which is also our dependent variable. I’d like to determine which features are best for predicting the quality of wine. I suspect some combination of the chemical properties variables can be used to build a predictive model to determine the quality of White wines.
It is very difficult to predict quality from the given variable at first glance. I did not notice any significant relationship even after facet wrapping various variables according to quality. Perhaps I could investigate further by taking residual sugar relations with other properties as a starting point to further my investigation.
I created a new variable called dryness which is based on the residual sugar content as mentioned below: < 1 g/L(d/dm^3) - Bone Dry 1 to 10 g/L - Dry 10 to 35 g/L - Off-Dry 35 to 120 g/L - Sweet Wine 120 to 220 g/L - Very Sweet Wine
Most of the wines are Dry in nature.
It was necessary to remove anomalies and extreme vales in some cases for better visualisations. Some properties like residual sugar and density had extreme values. In addition, the residual sugar of the white wine presented a long tail distribution. I used log10 transformation and got a bimodal distribution.
ggpairs(ww, lower = list(continuous = wrap("points", shape = I('.'))), upper = list(combo = wrap("box", outlier.shape = I('.'))))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
pairs.panels(ww[,-5],
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)
ggplot(aes(x = alcohol, y = quality), data = ww) +
geom_point(alpha = 1/5) +
geom_smooth()
## `geom_smooth()` using method = 'gam'
ggplot(aes(x = alcohol, y = quality), data = ww) +
geom_jitter(alpha = 1/5)
ggplot(aes(x = fixed.acidity, y = quality), data = ww) +
geom_point(alpha = 1/5) +
scale_x_continuous(limits = c(4, 10), breaks = seq(4, 10, .5)) + geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 9 rows containing non-finite values (stat_smooth).
## Warning: Removed 9 rows containing missing values (geom_point).
ggplot(aes(x = fixed.acidity, y = quality), data = ww) +
geom_jitter(alpha = 1/5) +
scale_x_continuous(limits = c(4, 10), breaks = seq(4, 10, .5))
## Warning: Removed 10 rows containing missing values (geom_point).
ggplot(aes(x = volatile.acidity, y = quality), data = ww) +
geom_point(alpha = 1/5) +
geom_smooth() +
scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, .1))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
ggplot(aes(x = volatile.acidity, y = quality), data = ww) +
geom_jitter(alpha = 1/5) +
scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, .1))
## Warning: Removed 2 rows containing missing values (geom_point).
ggplot(aes(x = citric.acid, y = quality), data = ww) +
geom_point(alpha = 1/5) +
geom_smooth() +
scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, .1))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
ggplot(aes(x = citric.acid, y = quality), data = ww) +
geom_jitter(alpha = 1/5) +
scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, .1))
## Warning: Removed 14 rows containing missing values (geom_point).
ggplot(aes(x = residual.sugar, y = quality), data = ww) +
geom_point(alpha = 1/5) +
geom_smooth() +
scale_x_continuous(limits= c(0, 25), breaks = seq(0, 25, 2))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).
ggplot(aes(x = residual.sugar, y = quality), data = ww) +
geom_jitter(alpha = 1/5) +
scale_x_continuous(limits= c(0, 25), breaks = seq(0, 25, 2))
## Warning: Removed 5 rows containing missing values (geom_point).
ggplot(aes(x = chlorides, y = quality), data = ww) +
geom_point(alpha = 1/5) +
geom_smooth() +
scale_x_continuous(limits = c(0, 0.1), breaks = seq(0, .1, .01))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 110 rows containing non-finite values (stat_smooth).
## Warning: Removed 110 rows containing missing values (geom_point).
ggplot(aes(x = residual.sugar, y = quality), data = ww) +
geom_jitter(alpha = 1/5) +
scale_x_continuous(limits= c(0, 25), breaks = seq(0, 25, 2))
## Warning: Removed 5 rows containing missing values (geom_point).
ggplot(aes(x = free.sulfur.dioxide, y = quality), data = ww) +
geom_point(alpha = 1/5) +
geom_smooth() +
scale_x_continuous(limits = c(0, quantile(ww$free.sulfur.dioxide, .99)), breaks = seq(0, 150, 10))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 43 rows containing non-finite values (stat_smooth).
## Warning: Removed 43 rows containing missing values (geom_point).
ggplot(aes(x = free.sulfur.dioxide, y = quality), data = ww) +
geom_jitter(alpha = 1/5) +
scale_x_continuous(limits = c(0, quantile(ww$free.sulfur.dioxide, .99)), breaks = seq(0, 150, 10))
## Warning: Removed 47 rows containing missing values (geom_point).
ggplot(aes(x = total.sulfur.dioxide, y = quality), data = ww) +
geom_point(alpha = 1/5) +
geom_smooth() +
scale_x_continuous(limits = c(0, quantile(ww$total.sulfur.dioxide, .99)), breaks = seq(0, 250, 20))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 49 rows containing non-finite values (stat_smooth).
## Warning: Removed 49 rows containing missing values (geom_point).
ggplot(aes(x = total.sulfur.dioxide, y = quality), data = ww) +
geom_jitter(alpha = 1/5) +
scale_x_continuous(limits = c(0, quantile(ww$total.sulfur.dioxide, .99)), breaks = seq(0, 250, 20))
## Warning: Removed 51 rows containing missing values (geom_point).
ggplot(aes(x = density, y = quality), data = ww) +
geom_point(alpha = 1/5) +
geom_smooth() +
scale_x_continuous(limits = c(.985, quantile(ww$density, .99)), breaks = seq(.985, 1.015, .005))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 49 rows containing non-finite values (stat_smooth).
## Warning: Removed 49 rows containing missing values (geom_point).
ggplot(aes(x = density, y = quality), data = ww) +
geom_jitter(alpha = 1/5) +
scale_x_continuous(limits = c(.985, quantile(ww$density, .99)), breaks = seq(.985, 1.015, .005))
## Warning: Removed 49 rows containing missing values (geom_point).
ggplot(aes(x = pH, y = quality), data = ww) +
geom_point(alpha = 1/5) +
geom_smooth() +
scale_x_continuous(limits = c(2.7, quantile(ww$pH, .99)))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 43 rows containing non-finite values (stat_smooth).
## Warning: Removed 43 rows containing missing values (geom_point).
ggplot(aes(x = pH, y = quality), data = ww) +
geom_jitter(alpha = 1/5) +
scale_x_continuous(limits = c(2.7, quantile(ww$pH, .99)))
## Warning: Removed 46 rows containing missing values (geom_point).
ggplot(aes(x = sulphates, y = quality), data = ww) +
geom_point(alpha = 1/5) +
geom_smooth() +
scale_x_continuous(limits = c(0.2, quantile(ww$sulphates, .99)))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 48 rows containing non-finite values (stat_smooth).
## Warning: Removed 48 rows containing missing values (geom_point).
ggplot(aes(x = sulphates, y = quality), data = ww) +
geom_jitter(alpha = 1/5) +
scale_x_continuous(limits = c(0.2, quantile(ww$sulphates, .99)))
## Warning: Removed 50 rows containing missing values (geom_point).
ggplot(aes(x = alcohol, y = quality), data = ww) +
geom_point(alpha = 1/5) +
geom_smooth() +
scale_x_continuous(limits = c(8, quantile(ww$alcohol, .99)))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 41 rows containing non-finite values (stat_smooth).
## Warning: Removed 41 rows containing missing values (geom_point).
ggplot(aes(x = alcohol, y = quality), data = ww) +
geom_jitter(alpha = 1/5) +
scale_x_continuous(limits = c(8, quantile(ww$alcohol, .99)))
## Warning: Removed 50 rows containing missing values (geom_point).
ww$total.acidity <- ww$fixed.acidity + ww$volatile.acidity
ggplot(aes(x = total.acidity, y = quality), data = ww) +
geom_point() +
geom_smooth() +
scale_x_continuous(limits = c(4, quantile(ww$total.acidity, .99)))
## `geom_smooth()` using method = 'gam'
## Warning: Removed 49 rows containing non-finite values (stat_smooth).
## Warning: Removed 49 rows containing missing values (geom_point).
ggplot(aes(x = total.acidity, y = quality), data = ww) +
geom_jitter(alpha = 1/5) +
scale_x_continuous(limits = c(4, quantile(ww$total.acidity, .99)))
## Warning: Removed 49 rows containing missing values (geom_point).
It is clear from above that alcohol has the strongest correlation with quality. Here are the noteworthy correlations involving quality. I had to utilize the integer version of the quality variable in order to calculate the correlations.
Quality and alcohol: 0.436 Quality and density: -0.307
However, both these correlations can’t be considered strong.
Let’s take a look at boxplots involving quality.
#Create a box plot for each variable
qp1 <- qplot(x = quality, y = fixed.acidity, data = ww,
geom = 'boxplot')
qp2 <- qplot(x = quality, y = volatile.acidity, data = ww,
geom = 'boxplot')
qp3 <- qplot(x = quality, y = citric.acid, data = ww,
geom = 'boxplot')
qp4 <- qplot(x = quality, y = residual.sugar, data = ww,
geom = 'boxplot')
qp5 <- qplot(x = quality, y = chlorides, data = ww,
geom = 'boxplot')
qp6 <- qplot(x = quality, y = free.sulfur.dioxide, data = ww,
geom = 'boxplot')
qp7 <- qplot(x = quality, y = total.sulfur.dioxide, data = ww,
geom = 'boxplot')
qp8 <- qplot(x = quality, y = density, data = ww,
geom = 'boxplot')
qp9 <- qplot(x = quality, y = pH, data = ww,
geom = 'boxplot')
qp10 <- qplot(x = quality, y = sulphates, data = ww,
geom = 'boxplot')
qp11 <- qplot(x = quality, y = alcohol, data = ww,
geom = 'boxplot')
grid.arrange(qp1,qp2,qp3,qp4,qp5,qp6,qp7,qp8,qp9,qp10,qp11)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
#Create a box plot for variables with highest correlation
grid.arrange(qp8,qp11)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
#Group the data by quality and then summarize by density
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:memisc':
##
## collect, recode, rename
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
quality.groups <- group_by(ww, quality)
winesByQuality <- summarize(quality.groups, mean_density = mean(density),
median_density = median(as.numeric(density)),
min_density = min(density),
max_density = max(density),
n = n())
winesByQuality
## # A tibble: 7 x 6
## quality mean_density median_density min_density max_density n
## <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 3 0.9948840 0.994425 0.99110 1.00010 20
## 2 4 0.9942767 0.994100 0.98920 1.00040 163
## 3 5 0.9952626 0.995300 0.98722 1.00241 1457
## 4 6 0.9939613 0.993660 0.98758 1.03898 2198
## 5 7 0.9924524 0.991760 0.98711 1.00040 880
## 6 8 0.9922359 0.991640 0.98713 1.00060 175
## 7 9 0.9914600 0.990300 0.98965 0.99700 5
The median data again show that as quality increases, density values decrease.
In addition to evaluating the correlations related to quality, I also want to probe how other variables work with each other. Here are the correlations of note that do not involve quality:
Total sulfur dioxide and residual sugar: 0.401 Total sulfur dioxide and free sulfur dioxide: 0.616 Total sulfur dioxide and alcohol: -0.449 Density and residual sugar: 0.839 Alcohol and density: -0.780 Residual sugar and alcohol: -0.451 Fixed acidity and pH: -0.426
Density, alcohol, and residual sugar all appear to be strongly correlated to each other, so I am going to take a closer look at those plots.
dn1 <- ggplot(aes(x = density, y = residual.sugar), data = ww) +
geom_point(alpha = 1/5) +
xlim(quantile(ww$density, 0.01),
quantile(ww$density, 0.99)) +
ylim(quantile(ww$residual.sugar, 0.01),
quantile(ww$residual.sugar, 0.99))
ac1 <- ggplot(aes(x = alcohol, y = density), data = ww) +
geom_jitter(alpha = 1/5) +
xlim(quantile(ww$alcohol, 0.01),
quantile(ww$alcohol, 0.99)) +
ylim(quantile(ww$density, 0.01),
quantile(ww$density, 0.99))
sg1 <- ggplot(aes(x = residual.sugar, y = alcohol), data = ww) +
geom_point(alpha = 1/5) +
xlim(quantile(ww$residual.sugar, 0.01),
quantile(ww$residual.sugar, 0.99)) +
ylim(quantile(ww$alcohol, 0.01),
quantile(ww$alcohol, 0.99))
grid.arrange(dn1, ac1, sg1)
## Warning: Removed 160 rows containing missing values (geom_point).
## Warning: Removed 201 rows containing missing values (geom_point).
## Warning: Removed 157 rows containing missing values (geom_point).
The correlations are very evident in the charts shown above. Sugar must be more dense than other ingredients in the wine, because higher density levels imply higher sugar quanity. Similarly, alcohol seems to imply lesser density. Lastly, alcohol and sugar may offset each other during the wine-making process, because lower levels of alcohol tend to have higher levels of sugar (and vice versa)
I also wants to make a special note about pH levels and acidity. All Three acidity values have strong correlation with pH. This is logical as higher pH value corresponds to lower acidity.
I evaluated all the variables with out main feature variable quality and observed that alcohol content has a strong impact on quality. However, it is still loosely correlated. Another variable that slightly influence quality may be the density.
Initially, as alcohol content increases, quality decreases. Subsequently when alcohol content increases, quality increases. This is not a linear model as represented by the smoothing line.
I discovered strong correlations between alcohol, residual sugar and density. As alcohol content increases, density tends to decrease rather linearly. Furthermore, as residual sugar increases density also increases. A linear model fits this well. Finally, as residual sugar level rises alcohol level decreases. This was clarified by the literatue available online. I mainly referred to literature provided by waterhouse.
The strongest correlation was seen between Density and Residual Sugar.
ww$quality.cat <- factor(ww$quality)
ggplot(aes(x = alcohol, y = density, color = quality.cat), data = ww) +
geom_point(size = 1, position = 'jitter') +
scale_color_brewer(type = 'seq',
guide = guide_legend(title = 'Quality', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(limits = c(8, 14.5), breaks = seq(8, 14.5, .5)) +
scale_y_continuous(limits = c(.985, 1.015), breaks = seq(.985, 1.015, .005))
## Warning: Removed 3 rows containing missing values (geom_point).
You can see that the graph generally gets darker to the right. And the corellation between alcohol and quality and density and quality is evident.
ww$free.sulfur.dioxide.cat <- ifelse(ww$free.sulfur.dioxide <= 50, '<= 50mg/l, not noticeable', '> 50mg/l, noticeable')
ww$free.sulfur.dioxide.cat <- factor(ww$free.sulfur.dioxide.cat)
ggplot(ww, aes(quality.cat, alcohol, fill = free.sulfur.dioxide.cat)) +
geom_jitter(alpha = 0.1) +
geom_boxplot()
Given the sae quality, win without sulfur aroma is more likely to have higher alcohol level. For instance, wines that have a quality score of 6 and don’t have sulfur smell, the median alcohol by volume is 10.6% as compared to 9.6 % among wines with same quality score with evident sulfur smell represented by blue boxplots. Therefore, you are more likely to get better quality wine if sulfur level is unnoticeable.
ggplot(ww,aes(quality.cat, alcohol)) +
geom_boxplot(aes(fill= free.sulfur.dioxide.cat), alpha = 0.5) +
theme(legend.position=c(1,1),legend.justification=c(1,1)) +
xlab('Wine Quality') +
ylab('Alcohol (% by volume)')
ggplot(ww, aes(quality, fill= free.sulfur.dioxide.cat)) +
geom_density(alpha=.5) +
theme(legend.position = "none") +
xlab('Wine Quality')
m1 <- lm(quality ~ alcohol, data = ww)
m2 <- update(m1, ~ . + density)
m3 <- update(m2, ~ . + chlorides)
m4 <- update(m3, ~ . + fixed.acidity)
m5 <- update(m4, ~ . + volatile.acidity)
m6 <- update(m5, ~ . + pH)
m7 <- update(m6, ~ . + total.sulfur.dioxide)
m8 <- update(m7, ~ . + log(residual.sugar))
m9 <- update(m8, ~ . + citric.acid)
m10 <- update(m9, ~ . + free.sulfur.dioxide)
m11 <- update(m10, ~ . + sulphates)
mtable(m1, m2, m3, m4, m5, m6, m7, m8, m9, m10, m11)
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = ww)
## m2: lm(formula = quality ~ alcohol + density, data = ww)
## m3: lm(formula = quality ~ alcohol + density + chlorides, data = ww)
## m4: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity,
## data = ww)
## m5: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity +
## volatile.acidity, data = ww)
## m6: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity +
## volatile.acidity + pH, data = ww)
## m7: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity +
## volatile.acidity + pH + total.sulfur.dioxide, data = ww)
## m8: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity +
## volatile.acidity + pH + total.sulfur.dioxide + log(residual.sugar),
## data = ww)
## m9: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity +
## volatile.acidity + pH + total.sulfur.dioxide + log(residual.sugar) +
## citric.acid, data = ww)
## m10: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity +
## volatile.acidity + pH + total.sulfur.dioxide + log(residual.sugar) +
## citric.acid + free.sulfur.dioxide, data = ww)
## m11: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity +
## volatile.acidity + pH + total.sulfur.dioxide + log(residual.sugar) +
## citric.acid + free.sulfur.dioxide + sulphates, data = ww)
##
## ==================================================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 2.582*** -22.492*** -21.150*** -31.387*** -47.652*** -47.870*** -43.543*** 41.731*** 42.639*** 37.700*** 47.757***
## (0.098) (6.165) (6.162) (6.355) (6.195) (6.222) (6.510) (11.223) (11.284) (11.294) (11.437)
## alcohol 0.313*** 0.360*** 0.343*** 0.356*** 0.405*** 0.406*** 0.408*** 0.310*** 0.308*** 0.310*** 0.296***
## (0.009) (0.015) (0.015) (0.015) (0.015) (0.015) (0.015) (0.018) (0.019) (0.019) (0.019)
## density 24.728*** 23.671*** 34.437*** 50.909*** 51.237*** 46.805*** -39.975*** -40.902*** -36.049** -46.226***
## (6.079) (6.074) (6.293) (6.137) (6.199) (6.501) (11.351) (11.414) (11.422) (11.567)
## chlorides -2.382*** -2.421*** -1.323* -1.334* -1.399** -0.762 -0.808 -0.831 -0.818
## (0.558) (0.555) (0.539) (0.540) (0.541) (0.541) (0.544) (0.542) (0.541)
## fixed.acidity -0.087*** -0.101*** -0.103*** -0.103*** -0.027 -0.029 -0.020 -0.014
## (0.014) (0.014) (0.015) (0.015) (0.017) (0.017) (0.017) (0.017)
## volatile.acidity -2.085*** -2.088*** -2.112*** -2.117*** -2.101*** -1.981*** -1.953***
## (0.110) (0.111) (0.111) (0.110) (0.112) (0.114) (0.114)
## pH -0.031 -0.042 0.326*** 0.332*** 0.343*** 0.317***
## (0.081) (0.081) (0.090) (0.090) (0.090) (0.090)
## total.sulfur.dioxide 0.001* 0.000 0.000 -0.001 -0.001*
## (0.000) (0.000) (0.000) (0.000) (0.000)
## log(residual.sugar) 0.225*** 0.226*** 0.210*** 0.232***
## (0.024) (0.024) (0.024) (0.025)
## citric.acid 0.075 0.057 0.037
## (0.097) (0.096) (0.096)
## free.sulfur.dioxide 0.004*** 0.004***
## (0.001) (0.001)
## sulphates 0.502***
## (0.099)
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.190 0.192 0.195 0.202 0.256 0.256 0.257 0.270 0.270 0.274 0.278
## adj. R-squared 0.190 0.192 0.195 0.201 0.255 0.255 0.256 0.269 0.269 0.272 0.276
## sigma 0.797 0.796 0.795 0.792 0.764 0.764 0.764 0.757 0.757 0.755 0.754
## F 1146.395 583.290 396.315 309.222 336.912 280.734 241.554 225.827 200.787 184.336 170.797
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5831.127 -5822.011 -5802.684 -5629.932 -5629.861 -5627.322 -5584.491 -5584.187 -5570.814 -5557.825
## Deviance 3112.257 3101.773 3090.247 3065.956 2857.136 2857.053 2854.093 2804.611 2804.262 2788.992 2774.238
## AIC 11684.782 11670.255 11654.021 11617.368 11273.865 11275.722 11272.645 11188.982 11190.373 11165.629 11141.649
## BIC 11704.272 11696.241 11686.504 11656.348 11319.341 11327.694 11331.114 11253.948 11261.836 11243.588 11226.105
## N 4898 4898 4898 4898 4898 4898 4898 4898 4898 4898 4898
## ==================================================================================================================================================================================
No combinations of variables coulg give a good model to predict quality score. The R2 value is very low evn after including all variables. This is not a strong correlation.
In this section, I tried to visualise some of the variables more concisely and precisely. Some of the insights into relationships between alcohol, density and residual sugars were strengthened.
It is interesting to note that the chemical properties trends of wines og 5 and below quality is almost the inverse of chemical property trends of wines of quality 6 and above. This might be due to the influence of an unknown variable which is not given in the dataset. Alternatively, there might be something that I have missed. The use of artificial flavouring and other chemical agents might give the same chemical properties for the low quality wines but different tastes.
I tried to fit a linear model into the dataset to predict the quality of white wine based on the features provided in the data set.
The model grew stronger as I added more features into the model. However, the linear model may not be the best way to represent this data. R2 values were too low and residuals were high. Using all the features provided is not very different from using only alcohol as a predictor, which was tried in the bivariate section. This might be because some of the features are correlated to each other.
To improve the model we might need to introduce new features into the model or new way to transform the data. Moreover, there might be a better method than linear to predict quality.
saq <- ggplot(aes(x = alcohol, y = quality), data = ww) +
geom_jitter(alpha = 1/10) +
xlab("Alcohol level (% by volume)") +
ylab("Quality score (0 to 10)") +
ggtitle("Scatterplot") +
scale_x_continuous(breaks = seq(8,14,1))
bqa <- qplot(x = quality, y = alcohol,
data = ww,
geom = 'boxplot') +
xlab("Quality score (0 to 10)") +
ylab("Alcohol level (% by volume)") +
ggtitle("Boxplot") +
scale_y_continuous(breaks = seq(8,14,1))
grid.arrange(saq,bqa)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
The strongest correlation observed between the feature of interest and any other feature was with alcohol at 0.436. This relationship can be visualised using the above chart. We can see that the concentration of points is increasing from left to right. That means as alcohol level increases quality also increases. Taking a closer look at th box plot we realise that the increasing trend is not steady. Between quality 3 and 5 it is a negative relationship. It is also safe to assume that after 12.5% alcohol content the quality of wine will decrease because the alcohol taste will overpower the native wine taste.
ggplot(aes(x = density, y = alcohol, color = as.factor(quality)), data = ww) +
geom_point() +
scale_color_brewer(type = 'qual') +
xlim(quantile(ww$density, 0.01),
quantile(ww$density, 0.99)) +
xlab("Density (g / cm^3)") +
ylab("Count of white wines") +
ggtitle("Histogram of Density with Color set by Quality")
## Warning: Removed 98 rows containing missing values (geom_point).
This is a good visualisation of the relationship between alcohol, density and quality. I have removed the outliers to make the visualisation better. However, for some reason I am not able to color the visualisation.
Alcohol and density is a negative relationship. That means as alchol content increases density decreases. Also, the better quality wines are concentrated at the left top of the graph. The graph disperses in the middle and converges at the right bottom. This also hints that as density increases, quality of wine tends to decrease.
The White Wines dataset contains information of 4898 samples of Portugese white wine (Vinho Verde) across 11 chemical properties and a special feature called quality score which was evaluated by wine experts. I started by exploring individual variables in the dataset and went on to investigate relationship between each chemical property with quality, which was chosen as the main feature in my analysis. Eventually, I tried to create a linear model to predict the quality of wine given other chemical properties.
There was a trend between quality and alcohol. But the other variables did not produce a strong correlation with quality. However, the variables were more or less strongly correlated with each other. Thgis might also be the reason why I was not able to come up with a linear model that predicts the quality score straight away. Transformations might be a technique that might have worked but I could not identify a direction to go forward with. Alternatively, absence of other features in the data set might also be a reason why I wasn’t able to produce a good linear model in my analysis.
Some limitations of this data includes missing features like Glycerol, Tannin, Amino acids, minerals, etc. Another limitation is that the quality score is a very subjective indicator. A more robust database could have produced a better model.
Having said that, this is the first project in R. I have so much to learn and I am sure that as the course progresses I will be able to deliver better.